Forecasting with Kats and InfluxDB  – The New Stack

2022-09-10 20:30:13 By : Mr. Jack Zhang

The Python library Kats can be used for time-series analysis, anomaly detection and forecasting. In this tutorial, we’ll learn about how to use Kats to forecast data from the InfluxDB time-series database.

We’ll also use the InfluxDB Python Client library to query data from InfluxDB and convert the data to a Pandas DataFrame to make working with the time-series data easier. Then we’ll make our forecast, and finally we’ll write the forecast back to InfluxDB.

The purpose of this tutorial is to help users get started using InfluxDB, the InfluxDB Python library and Kats together to make time-series forecasts.

It isn’t to help the user determine which model they should use. Kats has a backtesting module that helps you evaluate which model is the best fit for your time-series data.

This tutorial was executed on a macOS system with Python 3 installed via Homebrew. I recommend setting up additional tools like virtualenv, pyenv or conda-env to simplify Python and client installations. Otherwise, the full requirements are here:

This tutorial also assumes that you have either created a Free Tier InfluxDB Cloud account. It also assumes that you have:

This section highlights some, but not all, of the key features and tools of the Kats library. For time-series analysis, Kats offers users the ability to extract important features from their time-series data with the TsFeatures method. These features can help you determine which parameters to use when tuning your forecast models.

They can also help you evaluate which forecasting algorithms would work best for your data. They can also help you determine how difficult it will be to generate accurate forecasts on your time-series data. You can also use Kats to perform automatic hyperparameter tuning.

Hyperparameters are parameters of a machine learning model that are used to control the learning process. Popular time-series forecasting algorithms require hyperparameter tuning (including ARIMA models for example). Several models work by finding global maximum and minimum or optimal parameters to maximize prediction accuracy.

Often users must supply initial values and the algorithms find these optimal parameters. These user-supplied initial values are often best guesses. However, if the initial user-supplied values are too far from the optimal values, the algorithms can get stuck in local maxima and minima.

Hyperparameter tuning is the process of supplying different initial parameters to a model to get the global maxima and minima. The simplest approach to solving this problem is to true a series of combinations for each parameter. This approach is referred to as a grid search. This is also the approach that Kats offers. Finally, Kats also has a backtesting module that helps users determine which forecasting method to use. It allows you to compare different error metrics across different forecasting models.

Kats offers a lot of methods for time-series anomaly detection. Change point detection is the process of identifying change points. Change points are defined by abrupt changes in statistical properties in time-series data including variance, mean, correlation, spatial density and more.

Change points can often signify an anomaly. Kats offers a variety of methods for changepoint detection including the ProphetTrendDetector method for automatic change point detection. Kats also offers dynamic time warping (DTW) to compare the similarities and differences across two series to identify anomalies.

Finally, Kats also provides a lot of forecasting options like ARIMA/SARIMA (autoregressive integrated moving average), Holt-Winters, Theta, Prophet, Neural Prophet, Linear, LSTM (long-term short memory), RNN (recurrent neural network), VAR (vector autoregression) and more. It even has a Kats Ensemble method that allows you to combine various statistical time-series forecasting models.

The GM Ensemble method allows you to build exponential smoothing with RNNs. In this way, Kats aims to be a one-stop solution for time-series data science problems. In this tutorial, we’ll make a forecast with the pre-trained RNN-GME global model from Kats. Hybrid methods like this have proven to be most effective. This particular model was trained on 100,000 time series from the M4 (the 4th Makridakis Competition, a highly regarded time-series forecasting competition). It’s the same approach taken by the winners of that M4 competition. I decided to try forecasting with this model precisely because of the success of ES-RRN (exponential smoothing RNN) models in the M4 competition.

For this tutorial, we’ll use a sample dataset from InfluxDB. InfluxDB allows you to easily import datasets to get started using with Flux, the query, scripting and data processing language for InfluxDB. Specifically, we’ll use the NOAA water sample dataset (National Oceanic and Atmospheric Association). This dataset contains environmental data about two creeks in California. To write this data to InfluxDB, navigate to the Explorer page and copy the following Flux script:

A screenshot from the InfluxDB UI after using Flux to write the NOAA water sample dataset to the bucket “noaa”.

A screenshot from the InfluxDB UI after using Flux to write the NOAA water sample dataset to the bucket “noaa”.

Next, we’ll use the InfluxDB Python Client Library to query data from our InfluxDB instance. First, we import the client library and instantiate it. Then we build our Flux query, query our instance and return a data frame. We’ll query for temperature data from Coyote Creek.

We query for data from the last 40 days. Then we filter for the data we want. Finally, we apply an aggregateWindow() function to obtain daily temperature averages. We’re adding this transformation to prepare our data to make forecasts with the pre-trained RNN-GME global model from Kats. It makes daily forecasts, so we need our data to be daily. Next, we use the pivot() function to change the shape of our data so that our resulting DataFrame is in the expected shape. Finally, we drop all of the extraneous columns that we don’t need.

To forecast with the pre-trained RNN global model, we must first use the load_gmensemble_from_file method to load the pre-trained model. You can get the pre-trained model here.

gme_rnn = load_gmensemble_from_file("pretrained_daily_rnn.p")

Then we specify how many points we want to include in our forecast. We are forecasting the degrees for Coyote Creek for the next three days.

fcsts = gme_rnn.predict(df_ts, steps = 3)

Next, we convert the results to a DataFrame and combine with the validation portion of our dataset.

We can see here that our actual values “degrees” are very close to our median prediction values (fcst_quantile_0.5).

This tutorial doesn’t cover every line of code. To see the full script, check out this repo. 

I hope this blog post inspires you to take advantage of Kats and InfluxDB to make forecasts. The benefit of using Kats is that it offers a lot of tools for time-series data science problems. If you want to learn more about how RNN models work, I highly recommend reading this article about RNNs and LSTMs (a type of RNN). I also encourage you to take a look at the following repo which includes examples of how to work with many of the algorithms described here and InfluxDB to make forecasts and perform anomaly detection.